
In this assignment we will attempt to predict how car prices vary (usually going down) from first hand to second hand, according to different vehicle characteristics. We will exploit ensemble techniques to try to obtain as good models as possible.
Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.
![]() | You will need to solve a question by writing your own code or answer in the cell immediately below, or in a different file as instructed. Both correctness of the solution and code quality will be taken into account for marking. |
![]() | This is a hint or useful observation that can help you solve this assignment. You are not expected to write any solution, but you should pay attention to them to understand the assignment. |
![]() | This is an advanced and voluntary excercise that can help you gain a deeper knowledge into the topic. This exercise won't be taken into account towards marking, but you are encouraged to undertake it. Good luck! |
The following code imports the core libraries and objects used. Feel free to import other stuff if you need to in subsequent cells:
# Numpy and random seed
import numpy as np
RANDOM_STATE = 42; np.random.seed(RANDOM_STATE)
# Pandas with Holoviews
import holoviews as hv; hv.extension('bokeh')
from holoviews import opts
import hvplot.pandas
import pandas as pd
pd.options.plotting.backend = 'hvplot'
# Inline Matlab plotting
import matplotlib.pyplot as plt
%matplotlib inline
# Sklearn base classifiers
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
import warnings
warnings.filterwarnings('ignore')
Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Caps+Shift to produce a pop-out with related documentation. This will only work inside code cells.
Let's go!
In this assignment we will work with the vehicles dataset, which was retrieved from Craigslist:
#load the csv ignoring warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('./data/vehicles/vehicles.csv')
df
| price | year | manufacturer | condition | cylinders | fuel | odometer | title_status | transmission | drive | type | paint_color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6995 | 2000.0 | gmc | excellent | 8 cylinders | gas | 167783.0 | clean | automatic | 4wd | NaN | red |
| 1 | 8750 | 2013.0 | hyundai | excellent | 4 cylinders | gas | 90821.0 | clean | automatic | fwd | NaN | grey |
| 2 | 10900 | 2013.0 | toyota | good | 4 cylinders | hybrid | 92800.0 | clean | automatic | fwd | NaN | blue |
| 3 | 12500 | 2003.0 | mitsubishi | good | 4 cylinders | gas | NaN | clean | manual | 4wd | sedan | grey |
| 4 | 16995 | 2007.0 | gmc | good | 8 cylinders | diesel | 254217.0 | clean | automatic | 4wd | truck | white |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 423852 | 1600 | 2006.0 | hyundai | fair | 6 cylinders | gas | 159980.0 | clean | automatic | fwd | sedan | blue |
| 423853 | 9000 | 2003.0 | toyota | excellent | 8 cylinders | gas | 160000.0 | clean | automatic | 4wd | SUV | green |
| 423854 | 700 | 1994.0 | ford | fair | 6 cylinders | gas | 212000.0 | clean | manual | rwd | NaN | green |
| 423855 | 3800 | 1999.0 | lincoln | excellent | 8 cylinders | gas | 160000.0 | clean | automatic | rwd | sedan | NaN |
| 423856 | 8650 | 2015.0 | nissan | NaN | NaN | gas | 160526.0 | clean | automatic | fwd | sedan | silver |
423857 rows × 12 columns
Let's take a look at the columns and their meaning:
price: the price (in dollars) the car is trying to be sold at.year: year when that car was originally bought.manufacturer: the car brand (Audi, BMW...).condition: car status (like new, in good/bad condition...).cylinders: how many cylinders the engine has.fuel: what kind of fuel it uses.odometer: mileage it has when sold.title_status: if it's original or has been modified with extra pieces.transmission: type of transmission (manual, automatic...).drive: kind of drive (forward, rear...).type: category of car (compact, SUV, coupe...).paint_color: main color of the car (white, red...).We've a faily large dataset (more than 400K rows), but there're quite a lot of missing values! The only column that is always there is price:
df.isna().sum(axis=0)
price 0 year 95114 manufacturer 110615 condition 247138 cylinders 226178 fuel 96643 odometer 153272 title_status 96098 transmission 95792 drive 192738 type 182700 paint_color 201654 dtype: int64
To simplify things, let's simply eliminate all rows that have missing information. This leaves us with barely 86K samples:
df = df.dropna(how='any')
df = df.reset_index(drop=True)
df
| price | year | manufacturer | condition | cylinders | fuel | odometer | title_status | transmission | drive | type | paint_color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16995 | 2007.0 | gmc | good | 8 cylinders | diesel | 254217.0 | clean | automatic | 4wd | truck | white |
| 1 | 13995 | 2012.0 | ford | good | 6 cylinders | gas | 188406.0 | clean | automatic | 4wd | truck | grey |
| 2 | 7995 | 2010.0 | chevrolet | good | 4 cylinders | gas | 108124.0 | clean | automatic | 4wd | SUV | grey |
| 3 | 8995 | 2011.0 | chevrolet | good | 6 cylinders | gas | 178054.0 | clean | automatic | 4wd | SUV | white |
| 4 | 10995 | 2014.0 | ford | good | 6 cylinders | gas | 170259.0 | clean | automatic | 4wd | SUV | white |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85993 | 9885 | 2012.0 | subaru | excellent | 4 cylinders | gas | 82000.0 | clean | automatic | 4wd | hatchback | silver |
| 85994 | 750 | 2002.0 | saturn | salvage | 4 cylinders | gas | 40000.0 | missing | manual | fwd | coupe | blue |
| 85995 | 4800 | 2002.0 | ford | good | 6 cylinders | gas | 58000.0 | clean | automatic | rwd | coupe | blue |
| 85996 | 1600 | 2006.0 | hyundai | fair | 6 cylinders | gas | 159980.0 | clean | automatic | fwd | sedan | blue |
| 85997 | 9000 | 2003.0 | toyota | excellent | 8 cylinders | gas | 160000.0 | clean | automatic | 4wd | SUV | green |
85998 rows × 12 columns
Now we should also change year and odometer to integer type. As for cylinders, it can have the value 'other', but we can remove the word 'cylinders' itself:
df = df.astype({'price': 'int32', 'year': 'int16', 'odometer': 'int32'})
df['cylinders'] = df['cylinders'].where(df['cylinders'] == 'other', df['cylinders'].str.replace('cylinders', ''))
df
| price | year | manufacturer | condition | cylinders | fuel | odometer | title_status | transmission | drive | type | paint_color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16995 | 2007 | gmc | good | 8 | diesel | 254217 | clean | automatic | 4wd | truck | white |
| 1 | 13995 | 2012 | ford | good | 6 | gas | 188406 | clean | automatic | 4wd | truck | grey |
| 2 | 7995 | 2010 | chevrolet | good | 4 | gas | 108124 | clean | automatic | 4wd | SUV | grey |
| 3 | 8995 | 2011 | chevrolet | good | 6 | gas | 178054 | clean | automatic | 4wd | SUV | white |
| 4 | 10995 | 2014 | ford | good | 6 | gas | 170259 | clean | automatic | 4wd | SUV | white |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85993 | 9885 | 2012 | subaru | excellent | 4 | gas | 82000 | clean | automatic | 4wd | hatchback | silver |
| 85994 | 750 | 2002 | saturn | salvage | 4 | gas | 40000 | missing | manual | fwd | coupe | blue |
| 85995 | 4800 | 2002 | ford | good | 6 | gas | 58000 | clean | automatic | rwd | coupe | blue |
| 85996 | 1600 | 2006 | hyundai | fair | 6 | gas | 159980 | clean | automatic | fwd | sedan | blue |
| 85997 | 9000 | 2003 | toyota | excellent | 8 | gas | 160000 | clean | automatic | 4wd | SUV | green |
85998 rows × 12 columns
As for the numerical columns, note that there are some extreme values:
price: it shouldn't be 0 (free cars?), but of course it shouldn't be either 2.49 billion $!year: there're really ancient cars (1923), and even a car from the future! (from 2021, perhaps a Delorean?)odometer: maybe it can be 0 for a brand-new car, but 10 million miles?pd.set_option('display.float_format', lambda x: '%.3f' % x)
#To facilitate the display of the data, I remove the scientific notation format.
df.describe(percentiles=[0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99])
| price | year | odometer | |
|---|---|---|---|
| count | 85998.000 | 85998.000 | 85998.000 |
| mean | -7240.961 | 2009.090 | 113764.810 |
| std | 6167590.244 | 8.667 | 124175.071 |
| min | -1804435921.000 | 1923.000 | 0.000 |
| 1% | 0.000 | 1971.000 | 110.000 |
| 5% | 600.000 | 1996.000 | 13077.000 |
| 10% | 2495.000 | 2001.000 | 28000.000 |
| 25% | 4995.000 | 2006.000 | 64491.750 |
| 50% | 8995.000 | 2011.000 | 107960.500 |
| 75% | 16000.000 | 2015.000 | 151217.000 |
| 90% | 26873.000 | 2017.000 | 191600.000 |
| 95% | 34750.000 | 2018.000 | 218990.600 |
| 99% | 49500.000 | 2019.000 | 289003.090 |
| max | 123456789.000 | 2021.000 | 10000000.000 |
So let's do the following:
![]() |
Filter out those car samples for which:
|
#filtering the dataset
df = df.drop(df[(df['price']<500) | (df['price']>50000)].index)
df = df.drop(df[(df['year']<1990) | (df['year']>2020)].index)
df = df.drop(df[(df['odometer']==0) | (df['odometer']>300000)].index)
df
| price | year | manufacturer | condition | cylinders | fuel | odometer | title_status | transmission | drive | type | paint_color | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16995 | 2007 | gmc | good | 8 | diesel | 254217 | clean | automatic | 4wd | truck | white |
| 1 | 13995 | 2012 | ford | good | 6 | gas | 188406 | clean | automatic | 4wd | truck | grey |
| 2 | 7995 | 2010 | chevrolet | good | 4 | gas | 108124 | clean | automatic | 4wd | SUV | grey |
| 3 | 8995 | 2011 | chevrolet | good | 6 | gas | 178054 | clean | automatic | 4wd | SUV | white |
| 4 | 10995 | 2014 | ford | good | 6 | gas | 170259 | clean | automatic | 4wd | SUV | white |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85993 | 9885 | 2012 | subaru | excellent | 4 | gas | 82000 | clean | automatic | 4wd | hatchback | silver |
| 85994 | 750 | 2002 | saturn | salvage | 4 | gas | 40000 | missing | manual | fwd | coupe | blue |
| 85995 | 4800 | 2002 | ford | good | 6 | gas | 58000 | clean | automatic | rwd | coupe | blue |
| 85996 | 1600 | 2006 | hyundai | fair | 6 | gas | 159980 | clean | automatic | fwd | sedan | blue |
| 85997 | 9000 | 2003 | toyota | excellent | 8 | gas | 160000 | clean | automatic | 4wd | SUV | green |
77880 rows × 12 columns
df.describe()
| price | year | odometer | |
|---|---|---|---|
| count | 77880.000 | 77880.000 | 77880.000 |
| mean | 12425.825 | 2009.956 | 112367.979 |
| std | 9562.850 | 5.791 | 60041.653 |
| min | 500.000 | 1990.000 | 1.000 |
| 25% | 5500.000 | 2006.000 | 68562.500 |
| 50% | 9500.000 | 2011.000 | 110787.500 |
| 75% | 16500.000 | 2014.000 | 153000.000 |
| max | 50000.000 | 2020.000 | 300000.000 |
#now there are no missing values
df.isna().sum(axis=0)
price 0 year 0 manufacturer 0 condition 0 cylinders 0 fuel 0 odometer 0 title_status 0 transmission 0 drive 0 type 0 paint_color 0 dtype: int64
Now we obtain reasonable distributions on price and year, except that there aren't many cars for very recent years (which makes sense, as those are mostly second hand cars, and in order to be second hand some years must pass!):
(
df['price'].plot(kind='hist', bins=50, bin_range=(0, 50000)).opts(title='cars by price', xlabel='Price') +
df['year'].plot(kind='hist', bins=30, bin_range=(1990, 2020)).opts(title='cars by year', xlabel='Year') +
df['odometer'].plot(kind='hist', bins=30, bin_range=(0, 300000)).opts(title='cars by odometer', xlabel='Odometer')
).cols(1)
#As we can see, the graphs show quite proportionate and realistic results. In the "cars per year" graph, we can observe a dip in 2010 due to the banking crisis in the United States.
#That same graph shows a downward trend towards the end due to the transition between new and used cars, hence they will appear later in the market. It's not an anomaly, it's expected.
#The "cars by odometer" graph is also quite reasonable, as the most sold cars have an average of 100,000 miles. As the miles increase, sales decrease, as most will end up in the scrapyard.
![]() | Verify that the rest of features (all categorical) have reasonable values, and check which categories are allowed for each of those features. |
#To verify that the remaining features (all categorical) have reasonable values, I have thought of creating a bar chart for visualization.
#For each feature, I will need to convert the categorical values to numerical ones,
#so I create a variable called "feature_counts" to group the number of records for each subcategory. Then, I create a bar chart using Holoviews.
manufacturer_counts = df['manufacturer'].value_counts().reset_index()
manufacturer_counts.columns = ['manufacturer', 'count']
bars_manufacturer = hv.Bars(manufacturer_counts, ['manufacturer'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_manufacturer
#As we can see, the best-selling brands are American ones (Chevrolet, Ford, Jeep), followed by other well-known brands like Toyota, Honda (Japan), and Hyundai (Korea).
#High-end brands have fewer records, such as Porsche,Aston Martin or Tesla, either because they sell fewer vehicles or because they are less commonly purchased as used cars.
condition_counts = df['condition'].value_counts().reset_index()
condition_counts.columns = ['condition', 'count']
bars_condition = hv.Bars(condition_counts, ['condition'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_condition
#As we can see, the majority of cars are in excellent or good condition.
#This may be because cars in good condition have more chances of being sold, or it could be a marketing strategy.
cylinders_counts = df['cylinders'].value_counts().reset_index()
cylinders_counts.columns = ['cylinders', 'count']
bars_cylinders = hv.Bars(cylinders_counts, ['cylinders'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_cylinders
#Reasonable values for the number of cars given their displacement.
fuel_counts = df['fuel'].value_counts().reset_index()
fuel_counts.columns = ['fuel', 'count']
bars_fuel = hv.Bars(fuel_counts, ['fuel'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_fuel
#The majority of cars are gasoline-powered.
title_status_counts = df['title_status'].value_counts().reset_index()
title_status_counts.columns = ['title_status', 'count']
bars_title_status = hv.Bars(title_status_counts, ['title_status'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_title_status
#Reasonable values
transmission_counts = df['transmission'].value_counts().reset_index()
transmission_counts.columns = ['transmission', 'count']
bars_transmission = hv.Bars(transmission_counts, ['transmission'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_transmission
#In the US, the most common is that cars have an automatic transmission.
drive_counts = df['drive'].value_counts().reset_index()
drive_counts.columns = ['drive', 'count']
bars_drive = hv.Bars(drive_counts, ['drive'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_drive
#As we can see, rear-wheel drive is a very common type. Four-wheel drive also predominates for off-road vehicles, SUVs, and since the dataset also includes sales of trucks, vans, etc.
#Lastly, rear-wheel drive is less common as it is typically reserved for high-end sports vehicles.
ax=df.groupby(['drive', 'condition']).price.count().unstack().plot(kind = 'bar',figsize =(10, 7),
title = 'Número de coches por tracción, según condición', rot=45,
xlabel = 'Estado del coche',
ylabel = 'Número de coches')
ax
WARNING:param.main: hvPlot does not have the concept of a figure, and the figsize keyword will be ignored. The size of each subplot in a layout is set individually using the width and height options.
type_counts = df['type'].value_counts().reset_index()
type_counts.columns = ['type', 'count']
bars_type = hv.Bars(type_counts, ['type'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_type
paint_color_counts = df['paint_color'].value_counts().reset_index()
paint_color_counts.columns = ['paint_color', 'count']
bars_paint_color = hv.Bars(paint_color_counts, ['paint_color'], ['count']).opts(opts.Bars(width=600, height=400, tools=['hover'], xrotation=90))
bars_paint_color
Finally, for all those categorical features we need to generate dummy variables (one-hot encoding), so that all features can be treated by models as purely numerical:
#One-hot encoding
df = pd.get_dummies(df)
df
| price | year | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | ... | paint_color_brown | paint_color_custom | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16995 | 2007 | 254217 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 13995 | 2012 | 188406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 7995 | 2010 | 108124 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 8995 | 2011 | 178054 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 10995 | 2014 | 170259 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85993 | 9885 | 2012 | 82000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 85994 | 750 | 2002 | 40000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 85995 | 4800 | 2002 | 58000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 85996 | 1600 | 2006 | 159980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 85997 | 9000 | 2003 | 160000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
77880 rows × 98 columns
So that our final dataset has roughly 78K patterns with 100 columns each.
A well-known empirical result (https://www.free-online-calculator-use.com/car-depreciation-calculator.html) is the following:
This yields the following exponentially decreasing curve for a 30,000 $ car:

In order to check this out in our data, we need to switch from the year column to an age column:
![]() | Calculate the `age` column (using 2020 as the current year), add it to the data and delete the `year` column. |
#I calculate the age of the cars from their date and we substitute the variable
df['year'] = 2020 - df['year']
df = df.rename(columns={'year': 'age'})
df
| price | age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | ... | paint_color_brown | paint_color_custom | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16995 | 13 | 254217 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 13995 | 8 | 188406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 7995 | 10 | 108124 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 8995 | 9 | 178054 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 10995 | 6 | 170259 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85993 | 9885 | 8 | 82000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 85994 | 750 | 18 | 40000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 85995 | 4800 | 18 | 58000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 85996 | 1600 | 14 | 159980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 85997 | 9000 | 17 | 160000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
77880 rows × 98 columns
Note that we don't know the value of a car when it was purchased, but just the price it's being sold with, so we can't calculate its depreciation. However, ignoring inflation effects and assuming that cars worth the same value are purchased each year, we can approximate the above result if we group by age and calculate the average price of all cars with that age:
![]() | Calculate the mean price for each `age`, plot the results and calculate the % of decrease for each consecutive year. Do you obtain a similar result to the above one? |
#I group by age and calculate the average price by age
avg_price_by_age = df.groupby('age')['price'].mean()
avg_price_by_age
age 0 30033.957 1 28542.660 2 25884.679 3 22532.809 4 19179.775 5 18027.772 6 14848.909 7 13234.891 8 12040.311 9 11586.636 10 9387.075 11 8331.043 12 8741.302 13 7490.714 14 7435.246 15 6497.995 16 6310.352 17 5739.641 18 5820.298 19 5513.692 20 5444.955 21 5528.579 22 4920.148 23 5619.883 24 6006.901 25 6739.635 26 6195.187 27 6788.285 28 5825.006 29 6799.551 30 6674.127 Name: price, dtype: float64
plt.plot(avg_price_by_age)
plt.xlabel('Age')
plt.ylabel('Average Price')
plt.title('Average Price by Age')
plt.show()
# Create a DataFrame from the String 'avg_price_by_age'
df_avg_price_by_age = avg_price_by_age.reset_index()
# Generate the bar graph
plot = df_avg_price_by_age.hvplot.line('age', 'price', width=600, height=400)
plot
# Depreciación por año = (Precio de adquisición)/ Edad del coche)
df_edad_precio = df[['age', 'price']].copy()
df_edad_precio = df_edad_precio.groupby(['age']).mean().reset_index()
df_edad_precio["Depreciación"] = df_edad_precio ['price'].pct_change()
df_edad_precio["Depreciación"] = df_edad_precio ["Depreciación"] * 100
#the first row will have Nan value because it is the first year and depreciation cannot be calculated
df_edad_precio
| age | price | Depreciación | |
|---|---|---|---|
| 0 | 0 | 30033.957 | NaN |
| 1 | 1 | 28542.660 | -4.965 |
| 2 | 2 | 25884.679 | -9.312 |
| 3 | 3 | 22532.809 | -12.949 |
| 4 | 4 | 19179.775 | -14.881 |
| 5 | 5 | 18027.772 | -6.006 |
| 6 | 6 | 14848.909 | -17.633 |
| 7 | 7 | 13234.891 | -10.870 |
| 8 | 8 | 12040.311 | -9.026 |
| 9 | 9 | 11586.636 | -3.768 |
| 10 | 10 | 9387.075 | -18.984 |
| 11 | 11 | 8331.043 | -11.250 |
| 12 | 12 | 8741.302 | 4.924 |
| 13 | 13 | 7490.714 | -14.307 |
| 14 | 14 | 7435.246 | -0.740 |
| 15 | 15 | 6497.995 | -12.606 |
| 16 | 16 | 6310.352 | -2.888 |
| 17 | 17 | 5739.641 | -9.044 |
| 18 | 18 | 5820.298 | 1.405 |
| 19 | 19 | 5513.692 | -5.268 |
| 20 | 20 | 5444.955 | -1.247 |
| 21 | 21 | 5528.579 | 1.536 |
| 22 | 22 | 4920.148 | -11.005 |
| 23 | 23 | 5619.883 | 14.222 |
| 24 | 24 | 6006.901 | 6.887 |
| 25 | 25 | 6739.635 | 12.198 |
| 26 | 26 | 6195.187 | -8.078 |
| 27 | 27 | 6788.285 | 9.574 |
| 28 | 28 | 5825.006 | -14.190 |
| 29 | 29 | 6799.551 | 16.730 |
| 30 | 30 | 6674.127 | -1.845 |
#we leave the rows until the year 20 to be able to compare with the graph above
df_edad_precio = df_edad_precio.drop(range(21, 31),axis=0)
df_edad_precio
| age | price | Depreciación | |
|---|---|---|---|
| 0 | 0 | 30033.957 | NaN |
| 1 | 1 | 28542.660 | -4.965 |
| 2 | 2 | 25884.679 | -9.312 |
| 3 | 3 | 22532.809 | -12.949 |
| 4 | 4 | 19179.775 | -14.881 |
| 5 | 5 | 18027.772 | -6.006 |
| 6 | 6 | 14848.909 | -17.633 |
| 7 | 7 | 13234.891 | -10.870 |
| 8 | 8 | 12040.311 | -9.026 |
| 9 | 9 | 11586.636 | -3.768 |
| 10 | 10 | 9387.075 | -18.984 |
| 11 | 11 | 8331.043 | -11.250 |
| 12 | 12 | 8741.302 | 4.924 |
| 13 | 13 | 7490.714 | -14.307 |
| 14 | 14 | 7435.246 | -0.740 |
| 15 | 15 | 6497.995 | -12.606 |
| 16 | 16 | 6310.352 | -2.888 |
| 17 | 17 | 5739.641 | -9.044 |
| 18 | 18 | 5820.298 | 1.405 |
| 19 | 19 | 5513.692 | -5.268 |
| 20 | 20 | 5444.955 | -1.247 |
media_df_edad_precio = df_edad_precio.drop(range(0, 1),axis=0)
media_df_edad_precio
#we remove the year 0
| age | price | Depreciación | |
|---|---|---|---|
| 1 | 1 | 28542.660 | -4.965 |
| 2 | 2 | 25884.679 | -9.312 |
| 3 | 3 | 22532.809 | -12.949 |
| 4 | 4 | 19179.775 | -14.881 |
| 5 | 5 | 18027.772 | -6.006 |
| 6 | 6 | 14848.909 | -17.633 |
| 7 | 7 | 13234.891 | -10.870 |
| 8 | 8 | 12040.311 | -9.026 |
| 9 | 9 | 11586.636 | -3.768 |
| 10 | 10 | 9387.075 | -18.984 |
| 11 | 11 | 8331.043 | -11.250 |
| 12 | 12 | 8741.302 | 4.924 |
| 13 | 13 | 7490.714 | -14.307 |
| 14 | 14 | 7435.246 | -0.740 |
| 15 | 15 | 6497.995 | -12.606 |
| 16 | 16 | 6310.352 | -2.888 |
| 17 | 17 | 5739.641 | -9.044 |
| 18 | 18 | 5820.298 | 1.405 |
| 19 | 19 | 5513.692 | -5.268 |
| 20 | 20 | 5444.955 | -1.247 |
media_df_edad_precio['Depreciación'].mean()
-7.970666440655506
df_edad_precio.groupby(['age'])['price'].mean().plot(color = "#FF3364",
fontsize = 12,
title = 'Depreciacion vs años',
xlabel = 'Años',
ylabel = 'Precio medio'
)
#The results aren't quite the same. In the first year, there's roughly a 5% depreciation, which contrasts with the 24% shown in the first graph. As for the remaining years, there's approximately an 8% depreciation,
#which is significantly less than the 15% observed in the initial study."
Since age is quite correlated with odometer (the older a car is, the more mileage it tends to have), the above can be repeated with odometer:
![]() | Calculate the mean price for `odometer` (group by multiples of 10000 miles), plot the results and calculate the % of decrease for each consecutive group. Do you obtain a similar depreciation curve? |
![]() | As Pandas documentation clarifies, in order to group by a function on a column, you need to set first that column as the index with `set_index`. |
#I define the 'odometer' intervals
bins = pd.interval_range(start=0, end=300000, freq=10000)
# Now I can group by the intervals of 'odometer' and calculate the average of 'price' in each group
avg_price_by_odometer = df.groupby(pd.cut(df['odometer'], bins=bins))['price'].mean().reset_index()
avg_price_by_odometer
| odometer | price | |
|---|---|---|
| 0 | (0, 10000] | 20135.979 |
| 1 | (10000, 20000] | 27016.921 |
| 2 | (20000, 30000] | 24424.419 |
| 3 | (30000, 40000] | 22328.630 |
| 4 | (40000, 50000] | 19585.644 |
| 5 | (50000, 60000] | 17855.372 |
| 6 | (60000, 70000] | 16129.750 |
| 7 | (70000, 80000] | 14347.963 |
| 8 | (80000, 90000] | 13374.957 |
| 9 | (90000, 100000] | 12377.919 |
| 10 | (100000, 110000] | 11225.052 |
| 11 | (110000, 120000] | 10032.653 |
| 12 | (120000, 130000] | 9740.624 |
| 13 | (130000, 140000] | 8696.668 |
| 14 | (140000, 150000] | 8157.865 |
| 15 | (150000, 160000] | 7782.058 |
| 16 | (160000, 170000] | 7687.409 |
| 17 | (170000, 180000] | 6990.205 |
| 18 | (180000, 190000] | 7421.874 |
| 19 | (190000, 200000] | 6474.934 |
| 20 | (200000, 210000] | 6445.087 |
| 21 | (210000, 220000] | 5699.513 |
| 22 | (220000, 230000] | 5847.221 |
| 23 | (230000, 240000] | 6442.864 |
| 24 | (240000, 250000] | 6098.073 |
| 25 | (250000, 260000] | 5914.945 |
| 26 | (260000, 270000] | 6985.744 |
| 27 | (270000, 280000] | 7474.760 |
| 28 | (280000, 290000] | 5290.119 |
| 29 | (290000, 300000] | 5959.919 |
type(avg_price_by_odometer['odometer'])
pandas.core.series.Series
plt.plot(avg_price_by_odometer.index,avg_price_by_odometer['price'])
plt.xlabel('Odometer Bins')
plt.ylabel('Average Price')
plt.title('Average Price by Odometer')
plt.show()
# Calculate the change in average price from one interval to the next. This will be the depreciation in the price for each interval.
avg_price_by_odometer['depreciation_per_interval'] = avg_price_by_odometer['price'].diff()
# Calculate depreciation in percentage terms
avg_price_by_odometer['depreciation_percentage'] = avg_price_by_odometer['depreciation_per_interval'] / avg_price_by_odometer['price'].shift() * 100
avg_price_by_odometer
| odometer | price | depreciation_per_interval | depreciation_percentage | |
|---|---|---|---|---|
| 0 | (0, 10000] | 20135.979 | NaN | NaN |
| 1 | (10000, 20000] | 27016.921 | 6880.942 | 34.172 |
| 2 | (20000, 30000] | 24424.419 | -2592.503 | -9.596 |
| 3 | (30000, 40000] | 22328.630 | -2095.789 | -8.581 |
| 4 | (40000, 50000] | 19585.644 | -2742.986 | -12.285 |
| 5 | (50000, 60000] | 17855.372 | -1730.272 | -8.834 |
| 6 | (60000, 70000] | 16129.750 | -1725.621 | -9.664 |
| 7 | (70000, 80000] | 14347.963 | -1781.788 | -11.047 |
| 8 | (80000, 90000] | 13374.957 | -973.005 | -6.781 |
| 9 | (90000, 100000] | 12377.919 | -997.039 | -7.455 |
| 10 | (100000, 110000] | 11225.052 | -1152.867 | -9.314 |
| 11 | (110000, 120000] | 10032.653 | -1192.399 | -10.623 |
| 12 | (120000, 130000] | 9740.624 | -292.028 | -2.911 |
| 13 | (130000, 140000] | 8696.668 | -1043.956 | -10.718 |
| 14 | (140000, 150000] | 8157.865 | -538.804 | -6.196 |
| 15 | (150000, 160000] | 7782.058 | -375.806 | -4.607 |
| 16 | (160000, 170000] | 7687.409 | -94.649 | -1.216 |
| 17 | (170000, 180000] | 6990.205 | -697.204 | -9.069 |
| 18 | (180000, 190000] | 7421.874 | 431.669 | 6.175 |
| 19 | (190000, 200000] | 6474.934 | -946.940 | -12.759 |
| 20 | (200000, 210000] | 6445.087 | -29.847 | -0.461 |
| 21 | (210000, 220000] | 5699.513 | -745.574 | -11.568 |
| 22 | (220000, 230000] | 5847.221 | 147.708 | 2.592 |
| 23 | (230000, 240000] | 6442.864 | 595.643 | 10.187 |
| 24 | (240000, 250000] | 6098.073 | -344.791 | -5.352 |
| 25 | (250000, 260000] | 5914.945 | -183.128 | -3.003 |
| 26 | (260000, 270000] | 6985.744 | 1070.799 | 18.103 |
| 27 | (270000, 280000] | 7474.760 | 489.017 | 7.000 |
| 28 | (280000, 290000] | 5290.119 | -2184.641 | -29.227 |
| 29 | (290000, 300000] | 5959.919 | 669.801 | 12.661 |
#I remove the first row
#The first year has no depreciation because there is no prior year.
avg_price_by_odometer = avg_price_by_odometer.drop(range(0, 1),axis=0)
avg_price_by_odometer
| odometer | price | depreciation_per_interval | depreciation_percentage | |
|---|---|---|---|---|
| 1 | (10000, 20000] | 27016.921 | 6880.942 | 34.172 |
| 2 | (20000, 30000] | 24424.419 | -2592.503 | -9.596 |
| 3 | (30000, 40000] | 22328.630 | -2095.789 | -8.581 |
| 4 | (40000, 50000] | 19585.644 | -2742.986 | -12.285 |
| 5 | (50000, 60000] | 17855.372 | -1730.272 | -8.834 |
| 6 | (60000, 70000] | 16129.750 | -1725.621 | -9.664 |
| 7 | (70000, 80000] | 14347.963 | -1781.788 | -11.047 |
| 8 | (80000, 90000] | 13374.957 | -973.005 | -6.781 |
| 9 | (90000, 100000] | 12377.919 | -997.039 | -7.455 |
| 10 | (100000, 110000] | 11225.052 | -1152.867 | -9.314 |
| 11 | (110000, 120000] | 10032.653 | -1192.399 | -10.623 |
| 12 | (120000, 130000] | 9740.624 | -292.028 | -2.911 |
| 13 | (130000, 140000] | 8696.668 | -1043.956 | -10.718 |
| 14 | (140000, 150000] | 8157.865 | -538.804 | -6.196 |
| 15 | (150000, 160000] | 7782.058 | -375.806 | -4.607 |
| 16 | (160000, 170000] | 7687.409 | -94.649 | -1.216 |
| 17 | (170000, 180000] | 6990.205 | -697.204 | -9.069 |
| 18 | (180000, 190000] | 7421.874 | 431.669 | 6.175 |
| 19 | (190000, 200000] | 6474.934 | -946.940 | -12.759 |
| 20 | (200000, 210000] | 6445.087 | -29.847 | -0.461 |
| 21 | (210000, 220000] | 5699.513 | -745.574 | -11.568 |
| 22 | (220000, 230000] | 5847.221 | 147.708 | 2.592 |
| 23 | (230000, 240000] | 6442.864 | 595.643 | 10.187 |
| 24 | (240000, 250000] | 6098.073 | -344.791 | -5.352 |
| 25 | (250000, 260000] | 5914.945 | -183.128 | -3.003 |
| 26 | (260000, 270000] | 6985.744 | 1070.799 | 18.103 |
| 27 | (270000, 280000] | 7474.760 | 489.017 | 7.000 |
| 28 | (280000, 290000] | 5290.119 | -2184.641 | -29.227 |
| 29 | (290000, 300000] | 5959.919 | 669.801 | 12.661 |
avg_price_by_odometer_depr =avg_price_by_odometer.drop(range(16, 30),axis=0)
avg_price_by_odometer_depr
| odometer | price | depreciation_per_interval | depreciation_percentage | |
|---|---|---|---|---|
| 1 | (10000, 20000] | 27016.921 | 6880.942 | 34.172 |
| 2 | (20000, 30000] | 24424.419 | -2592.503 | -9.596 |
| 3 | (30000, 40000] | 22328.630 | -2095.789 | -8.581 |
| 4 | (40000, 50000] | 19585.644 | -2742.986 | -12.285 |
| 5 | (50000, 60000] | 17855.372 | -1730.272 | -8.834 |
| 6 | (60000, 70000] | 16129.750 | -1725.621 | -9.664 |
| 7 | (70000, 80000] | 14347.963 | -1781.788 | -11.047 |
| 8 | (80000, 90000] | 13374.957 | -973.005 | -6.781 |
| 9 | (90000, 100000] | 12377.919 | -997.039 | -7.455 |
| 10 | (100000, 110000] | 11225.052 | -1152.867 | -9.314 |
| 11 | (110000, 120000] | 10032.653 | -1192.399 | -10.623 |
| 12 | (120000, 130000] | 9740.624 | -292.028 | -2.911 |
| 13 | (130000, 140000] | 8696.668 | -1043.956 | -10.718 |
| 14 | (140000, 150000] | 8157.865 | -538.804 | -6.196 |
| 15 | (150000, 160000] | 7782.058 | -375.806 | -4.607 |
avg_price_by_odometer_depr['depreciation_percentage'].mean()
#The average depreciation for the remaining years differs slightly from the previous one.
-5.6291535703642115
Fair enough, as this confirms that our clean data follow the expected trend. Once we rely on our data, it's time to split them into training and testing sets:
![]() | Divide `df` into: * `X` (information of cars excluding `price`). * `Y` (`price` of those cars). Use 80% of cars for training, and the remaining 20% for testing, **Remember to use `RANDOM_STATE`** so that this division is always the same, no matter how many times you rerun your notebook. |
#I take all the columns except price, which will be the independent variables
df = df.reset_index(drop=True)
x = df.loc[:, df.columns != 'price']
# I take the price column that will be my target variable
y = df.loc[:, df.columns == 'price']
x
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_brown | paint_color_custom | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13 | 254217 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 8 | 188406 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 10 | 108124 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 9 | 178054 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 6 | 170259 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 77875 | 8 | 82000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 77876 | 18 | 40000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 77877 | 18 | 58000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 77878 | 14 | 159980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 77879 | 17 | 160000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
77880 rows × 97 columns
y
| price | |
|---|---|
| 0 | 16995 |
| 1 | 13995 |
| 2 | 7995 |
| 3 | 8995 |
| 4 | 10995 |
| ... | ... |
| 77875 | 9885 |
| 77876 | 750 |
| 77877 | 4800 |
| 77878 | 1600 |
| 77879 | 9000 |
77880 rows × 1 columns
train, test = train_test_split(df,
train_size = 0.80,
random_state = 42)
#I do a train-test split and set the seed
target = 'price'
indp_columns = [x for x in df.columns if x !='price']
x_train = train[indp_columns]
y_train = train[target]
x_test = test[indp_columns]
y_test = test[target]
x_train
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_brown | paint_color_custom | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29645 | 4 | 72500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 68355 | 17 | 136000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 52304 | 8 | 109526 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 62400 | 14 | 184850 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11259 | 4 | 70000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6265 | 2 | 58457 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 54886 | 15 | 215000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 76820 | 13 | 138600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 860 | 7 | 80000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 15795 | 3 | 53171 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
62304 rows × 97 columns
y_train
29645 24495
68355 6500
52304 7300
62400 3995
11259 31900
...
6265 12900
54886 8995
76820 3300
860 8299
15795 40990
Name: price, Length: 62304, dtype: int32
So that now we can train a very basic model:
![]() | Use `age` to train a linear regression model, that is, find values for `w` and `b` such that the estimated price is $\hat{y} = w·age + b$. Which values do you obtain for `w` and `b`? Can you explain in plain words what this model is doing? Do you see any obvious problem if we wanted to use this model in practice? |
![]() | Check the documentation of LinearRegression to know how to obtain `w` and `b`. |
#linear regression
lr = LinearRegression(fit_intercept=True)
# We call the fit method to train the regression
lr.fit (x_train [['age']], y_train)
LinearRegression()
print("Intercept: " + str(lr.intercept_))
print("(w): " + str(lr.coef_))
#y= -962.26429718.age + 22095.424788176766
#w:decreasing line
Intercept: 22095.42478817674 (w): [-962.26429718]
año_precio = train.groupby(['age'])['price'].mean()
año_precio = año_precio.values
train_Pmedio = pd.DataFrame()
train_Pmedio['Edad'] = np.arange(0, 31)
train_Pmedio['Precio'] = año_precio
x_train_Pmedio = train_Pmedio['Edad'].values
y_train_Pmedio = train_Pmedio['Precio'].values
w = -962.26429718
b = 22095.424788176766
plt.figure(figsize=(8,6))
plt.scatter(x_train_Pmedio, y_train_Pmedio, c='#33FFB2', s=60, label='Valores reales')
plt.plot ( x_train_Pmedio, w*x_train_Pmedio + b, color='#FF3352', label="Regresión Lineal")
plt.title("Regresión Lineal vs Valores reales", fontsize = 16)
plt.xlabel("Age", fontsize = 10)
plt.ylabel("Price", fontsize = 10)
plt.legend();
#Linear regression plot vs actual values
And check it out in test:
![]() | Use the linear regression model to predict for `X_test` cars. What score do you obtain? How does this compare to `X_train` cars? Why is this? |
![]() | A good way to grasp what is going on is to plot with two different line plots: * What the model says for a given `age`. * The average price in the dataset for that `age`. Do this either for `X_train` and `X_test` separately, or for the whole dataset at once. |
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#Metrics
mse = mean_squared_error(y_true = y_test, y_pred = lr.predict(x_test[["age"]]).reshape(-1, 1))
#MSE
rmse = np.sqrt(mse)
#MAE
mae = mean_absolute_error(y_true = y_test, y_pred = lr.predict(x_test[["age"]]).reshape(-1, 1))
print ("Train score (R2): ", lr.score(x_train [['age']], y_train))
print("MSE test: " + str(mse))
print("RMSE test: " + str(rmse))
print("MAE test: " + str(mae))
Train score (R2): 0.3399370598042252 MSE test: 60487159.02335366 RMSE test: 7777.349100005326 MAE test: 5876.8384209432525
prediccion = pd.DataFrame({'age': np.arange(0,31)})
prediccion['price'] = lr.predict(prediccion[['age']])
linea1 = df.groupby(['age'])['price'].mean().plot (x = 'age', y = 'price', color = "#FF3333",
fontsize = 12,
title = 'prediction vs average price by year',
xlabel = 'Edad',
ylabel = 'Precio')
linea2 = prediccion.plot (x = 'age', y = 'price', kind = 'line', color = "#93FF33")
linea1 * linea2
#the prediction is not very good, especially at the end of the line, where it predicts negative prices
Let's repeat this with odometer:
![]() | Use `odometer` to train another linear regression model, that is, find values for `w` and `b` such that the estimated price is $\hat{y} = w·odometer + b$. Which values do you obtain now for `w` and `b`? Interpret the resulting model, and see if it behaves significantly better/worse for `X_test`. Why is this if `odometer` is very correlated with `age`? |
#Linear regression odometer
#independent variables
x = df.loc[:, df.columns != 'price']
# target: dependent variable
y = df.loc[:, df.columns == 'price']
target = 'price'
indp_columns = [x for x in df.columns if x !='price']
x_train = train[indp_columns]
y_train = train[target]
lr_millas = LinearRegression(fit_intercept=True)
lr_millas.fit (x_train [['odometer']], y_train)
LinearRegression()
print("Intercept " + str(lr_millas.intercept_))
print("(w): " + str(lr_millas.coef_))
#Intercept: 22022.01
#w: -0-085
#A decreasing linear function. The cars depreciate by an additional $0.085 per mile.
Intercept 22022.011181975868 (w): [-0.08524899]
One of the problems with the above linear regressions is that we are taking into account just one variable (either age or odometer) to predict. But there are many more features in the dataset, including all the categorical ones. Let's fit a tree to see what happens:
![]() | Train a decision tree with all features (use `max_depth=5` to try to prevent overfitting) and store it in a variable called `dt`. Do you obtain better predictions now in test? |
target = 'price'
indp_columns = [x for x in df.columns if x !='price']
X_train = train[indp_columns]
y_train = train[target]
X_test = test[indp_columns]
y_test = test[target]
# Decission tree Regressor
dt = DecisionTreeRegressor(criterion="mse",
max_depth=5,
min_samples_split=5,
min_samples_leaf=5
)
# fit:
dt.fit(X_train, y_train)
# metrics:
mse_dt = mean_squared_error(y_true = y_test, y_pred = dt.predict(X_test))
rmse_dt = np.sqrt(mse_dt)
print("MSE: " + str(mse_dt))
print("RMSE: " + str(rmse_dt))
#I manage to reduce the error by approximately 2000 dollars
#the results are better than in linear regression
MSE: 28367308.870631143 RMSE: 5326.0969640658195
One of the advantages of decision trees is that their nodes can be plot to see how the tree is deciding to split. This can be done using the plot_tree utility function. For example, the first levels can be shown as follows:
_, ax = plt.subplots(figsize=(20, 6))
plot_tree(dt, max_depth=2, ax=ax, feature_names=X_train.columns, impurity=False, fontsize=16);
Which shows that depending on the branch the price prediction (value) differs notably. The features that are selected most times to split are reflected in the feature_importances_ field:
![]() | Plot the importance of the 10 most important features. Which are those features? Does this make sense according to your intuition? |
![]() | You may find the `argsort` function useful when trying to obtain the most important features. Once you have them, you can use a standard bar plot with feature names as `x` axis and importances as `y` axis. |
dt.feature_importances_
array([0.59090243, 0.05797833, 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0.00101937,
0.15162408, 0. , 0.01146445, 0.02964704, 0. ,
0.04182367, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.10835136, 0.00535401, 0.00183525, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ])
orden = dt.feature_importances_.argsort()
orden
array([48, 68, 67, 66, 65, 64, 63, 62, 61, 69, 60, 58, 57, 56, 54, 51, 95,
47, 46, 59, 73, 74, 75, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84,
83, 82, 81, 80, 79, 78, 77, 76, 45, 44, 96, 42, 43, 22, 21, 20, 19,
18, 17, 16, 15, 3, 14, 12, 11, 10, 9, 8, 7, 6, 5, 4, 13, 24,
23, 26, 41, 40, 39, 38, 37, 25, 35, 34, 36, 32, 31, 30, 29, 28, 27,
33, 2, 49, 72, 71, 52, 53, 55, 1, 70, 50, 0], dtype=int64)
features_importances = pd.DataFrame(columns = X_train.columns[orden])
features_importances
| cylinders_12 | transmission_other | transmission_manual | transmission_automatic | title_status_salvage | title_status_rebuilt | title_status_parts only | title_status_missing | title_status_lien | drive_4wd | ... | cylinders_3 | type_SUV | drive_rwd | cylinders_6 | cylinders_8 | fuel_diesel | odometer | drive_fwd | cylinders_4 | age |
|---|
0 rows × 97 columns
features_importances.loc[0]=dt.feature_importances_[orden]
features_importances
| cylinders_12 | transmission_other | transmission_manual | transmission_automatic | title_status_salvage | title_status_rebuilt | title_status_parts only | title_status_missing | title_status_lien | drive_4wd | ... | cylinders_3 | type_SUV | drive_rwd | cylinders_6 | cylinders_8 | fuel_diesel | odometer | drive_fwd | cylinders_4 | age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | ... | 0.001 | 0.002 | 0.005 | 0.011 | 0.030 | 0.042 | 0.058 | 0.108 | 0.152 | 0.591 |
1 rows × 97 columns
Top_fi = features_importances.drop(features_importances.columns[0: 87], axis='columns')
Top_fi
| cylinders_3 | type_SUV | drive_rwd | cylinders_6 | cylinders_8 | fuel_diesel | odometer | drive_fwd | cylinders_4 | age | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.001 | 0.002 | 0.005 | 0.011 | 0.030 | 0.042 | 0.058 | 0.108 | 0.152 | 0.591 |
Top_fi.plot.bar(fontsize = 12,
title = 'top variables according to importance',
xlabel = 'Decission tree',
ylabel = 'importance')
#age, cilynders and transmission
The model in dt should have convinced you that trees show predictive power for this dataset, but that it's not enough with a single tree. Time now to try tree ensembles:
![]() | Train tree ensembles on this dataset. Feel free to edit this notebook and to try different: * Techniques (bagging, boosting, ...). * Implementations (sklearn, xgb, ...). * Parameter selections (number of trees, depth of trees, ...). Explain your approach. What is the best model that you can come up with? |
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm as lgb
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
rf_reg = RandomForestRegressor(random_state = RANDOM_STATE)
extree_reg = ExtraTreesRegressor(random_state = RANDOM_STATE)
ab_reg = AdaBoostRegressor(random_state = RANDOM_STATE)
gb_reg = GradientBoostingRegressor(random_state = RANDOM_STATE)
lgb_reg = LGBMRegressor(random_state = RANDOM_STATE)
cb_reg = CatBoostRegressor(random_state = RANDOM_STATE)
xgb_reg = XGBRegressor(random_state = RANDOM_STATE)
modelos = [
('rf_reg', RandomForestRegressor()), ('extree_reg', ExtraTreesRegressor()), ('ab_reg', AdaBoostRegressor()),
('gb_reg', GradientBoostingRegressor()), ('lgb_reg', LGBMRegressor()),
('cb_reg', CatBoostRegressor(), ('xgb_reg'), XGBRegressor())]
rf_reg = RandomForestRegressor (n_estimators = 300,
max_depth = 30,
min_samples_split = 10,
ccp_alpha = 0.1,
min_samples_leaf = 5)
rf_reg.fit(X_train, y_train)
print('score train: ', rf_reg.score (X_train, y_train))
print('score test: ', rf_reg.score (X_test, y_test))
#Overfitting
score train: 0.9261437802352984 score test: 0.8704634725386805
rf_reg2 = RandomForestRegressor(n_estimators = 150,
max_depth = 30,
max_features = "sqrt",
ccp_alpha = 0.1,
min_samples_split = 10,
min_samples_leaf = 5
)
rf_reg2.fit(X_train, y_train)
print('score train: ', rf_reg2.score (X_train, y_train))
print('score test: ', rf_reg2.score (X_test, y_test))
#Overfitting
score train: 0.8523821212629688 score test: 0.8308471404809338
rf_reg3 = RandomForestRegressor(n_estimators = 150,
max_depth = 10,
max_features = "sqrt",
ccp_alpha = 0.1,
min_samples_split = 10,
min_samples_leaf = 5
)
rf_reg3.fit(X_train, y_train)
print('score train: ', rf_reg3.score (X_train, y_train))
print('score test: ', rf_reg3.score (X_test, y_test))
score train: 0.7471418082581973 score test: 0.7375122425853795
extree_reg = ExtraTreesRegressor (n_estimators = 300,
max_depth = 30,
max_features = "sqrt",
min_samples_split = 10,
ccp_alpha = 0.1,
min_samples_leaf = 5)
extree_reg.fit(X_train, y_train)
print('score train: ', extree_reg.score (X_train, y_train))
print('score test: ', extree_reg.score (X_test, y_test))
score train: 0.8020990473006532 score test: 0.7788793049512839
extree_reg2 = ExtraTreesRegressor(n_estimators = 150,
max_depth = 30,
max_features = "sqrt",
ccp_alpha = 0.1,
min_samples_split = 2,
min_samples_leaf = 1
)
extree_reg2.fit(X_train, y_train)
print('score train: ', extree_reg2.score (X_train, y_train))
print('score test: ', extree_reg2.score (X_test, y_test))
#I've reduced the number of trees but overfitting has increased
score train: 0.9760695664157354 score test: 0.8754761478170683
grid_ab_reg = {"n_estimators": [150, 300, 500],
"learning_rate": [1.0]
}
gs_ab_reg = GridSearchCV(estimator = ab_reg,
param_grid = grid_ab_reg,
scoring = 'r2',
n_jobs=-1,
cv=10,
verbose=1)
gs_ab_reg.fit(X_train, y_train)
print('Mejor selección: ', gs_ab_reg.best_estimator_)
print('Score train: ', gs_ab_reg.best_score_)
Fitting 10 folds for each of 3 candidates, totalling 30 fits Mejor selección: AdaBoostRegressor(n_estimators=150, random_state=42) Score train: 0.5308523379539066
grid_gb_reg = {"learning_rate": [0.1, 0.5],
"n_estimators": [50,100],
"min_samples_split": [1, 5],
"min_samples_leaf": [2, 4],
"max_depth": [1,5,9],
}
# GridSearch:
gs_gb_reg = GridSearchCV(estimator = gb_reg,
param_grid = grid_gb_reg,
scoring = 'r2',
n_jobs=-1,
cv=10,
verbose=1)
gs_gb_reg.fit(X_train, y_train)
print('Mejor selección: ', gs_gb_reg.best_estimator_)
print('Score train: ', gs_gb_reg.best_score_)
Fitting 10 folds for each of 48 candidates, totalling 480 fits
Mejor selección: GradientBoostingRegressor(max_depth=9, min_samples_leaf=2, min_samples_split=5,
random_state=42)
Score train: 0.8826310919750598
print('score test: ', gs_gb_reg.score (X_test, y_test))
score test: 0.8806625226588694
grid_lgb_reg = {"num_leaves": [7, 14, 21, 28, 31, 50],
"learning_rate": [0.1, 0.03],
"n_estimators": [150,300],
"max_depth": [-1,3,5,9,11],
}
# GridSearch:
gs_lgb_reg = GridSearchCV(estimator = lgb_reg,
param_grid = grid_lgb_reg,
scoring = 'r2',
n_jobs=-1,
cv=10,
verbose=1)
gs_lgb_reg.fit(X_train, y_train)
print('Mejor selección: ', gs_lgb_reg.best_estimator_)
print('Score train: ', gs_lgb_reg.best_score_)
Fitting 10 folds for each of 120 candidates, totalling 1200 fits Mejor selección: LGBMRegressor(n_estimators=300, num_leaves=50, random_state=42) Score train: 0.8815811343287233
# Grid de hiperparámetros:
grid_cb_reg = {"iterations": [100, 200, 300],
"learning_rate": [0.03, 0.1, 0.5],
"depth": [3, 9],
"random_strength": [0.2, 0.5, 1, 3]
}
# GridSearch:
gs_cb_reg = GridSearchCV(estimator = cb_reg,
param_grid = grid_cb_reg,
scoring = 'r2',
n_jobs=-1,
cv=10,
verbose=1)
gs_cb_reg.fit(X_train, y_train)
print('Mejor selección: ', gs_cb_reg.best_estimator_)
print('Score train: ', gs_cb_reg.best_score_)
Fitting 10 folds for each of 72 candidates, totalling 720 fits 0: learn: 6435.0575113 total: 166ms remaining: 49.5s 1: learn: 5194.1233271 total: 183ms remaining: 27.3s 2: learn: 4614.8593157 total: 200ms remaining: 19.8s 3: learn: 4354.2996171 total: 217ms remaining: 16.1s 4: learn: 4184.3281913 total: 234ms remaining: 13.8s 5: learn: 4090.6489786 total: 251ms remaining: 12.3s 6: learn: 4016.0343241 total: 267ms remaining: 11.2s 7: learn: 3976.2338440 total: 284ms remaining: 10.4s 8: learn: 3920.9469790 total: 303ms remaining: 9.78s 9: learn: 3869.1175106 total: 319ms remaining: 9.24s 10: learn: 3826.5859804 total: 338ms remaining: 8.87s 11: learn: 3793.6420583 total: 368ms remaining: 8.84s 12: learn: 3760.9819132 total: 402ms remaining: 8.88s 13: learn: 3741.1394924 total: 421ms remaining: 8.6s 14: learn: 3711.5111046 total: 444ms remaining: 8.43s 15: learn: 3692.8604979 total: 462ms remaining: 8.21s 16: learn: 3668.8791352 total: 480ms remaining: 8s 17: learn: 3652.5281991 total: 499ms remaining: 7.82s 18: learn: 3614.6105095 total: 518ms remaining: 7.67s 19: learn: 3596.9250794 total: 536ms remaining: 7.51s 20: learn: 3585.2377403 total: 555ms remaining: 7.38s 21: learn: 3569.7823145 total: 573ms remaining: 7.24s 22: learn: 3547.7422642 total: 594ms remaining: 7.16s 23: learn: 3534.3671556 total: 618ms remaining: 7.11s 24: learn: 3514.6886666 total: 640ms remaining: 7.04s 25: learn: 3495.8946048 total: 659ms remaining: 6.94s 26: learn: 3481.6069594 total: 677ms remaining: 6.84s 27: learn: 3470.2079406 total: 695ms remaining: 6.75s 28: learn: 3446.8092016 total: 713ms remaining: 6.67s 29: learn: 3427.0861584 total: 733ms remaining: 6.6s 30: learn: 3412.8999998 total: 757ms remaining: 6.57s 31: learn: 3405.2466042 total: 777ms remaining: 6.51s 32: learn: 3387.7019719 total: 799ms remaining: 6.46s 33: learn: 3377.5416207 total: 824ms remaining: 6.45s 34: learn: 3363.6473267 total: 848ms remaining: 6.42s 35: learn: 3352.5042625 total: 871ms remaining: 6.39s 36: learn: 3347.4901265 total: 890ms remaining: 6.33s 37: learn: 3336.4869666 total: 911ms remaining: 6.28s 38: learn: 3322.6411015 total: 931ms remaining: 6.23s 39: learn: 3312.8176748 total: 950ms remaining: 6.18s 40: learn: 3303.7794669 total: 967ms remaining: 6.11s 41: learn: 3292.9941103 total: 985ms remaining: 6.05s 42: learn: 3289.0305148 total: 1s remaining: 5.99s 43: learn: 3278.8806966 total: 1.03s remaining: 5.97s 44: learn: 3261.8821679 total: 1.05s remaining: 5.96s 45: learn: 3257.4616576 total: 1.07s remaining: 5.92s 46: learn: 3248.0766410 total: 1.09s remaining: 5.87s 47: learn: 3233.0625499 total: 1.11s remaining: 5.82s 48: learn: 3227.7758132 total: 1.13s remaining: 5.78s 49: learn: 3216.9585403 total: 1.15s remaining: 5.75s 50: learn: 3205.7267089 total: 1.17s remaining: 5.71s 51: learn: 3202.2572991 total: 1.19s remaining: 5.66s 52: learn: 3196.1935529 total: 1.21s remaining: 5.62s 53: learn: 3192.1681258 total: 1.23s remaining: 5.58s 54: learn: 3184.7819725 total: 1.25s remaining: 5.55s 55: learn: 3175.6385500 total: 1.27s remaining: 5.54s 56: learn: 3169.3637084 total: 1.29s remaining: 5.51s 57: learn: 3162.0195281 total: 1.31s remaining: 5.48s 58: learn: 3154.3393886 total: 1.33s remaining: 5.44s 59: learn: 3148.8559108 total: 1.35s remaining: 5.4s 60: learn: 3141.3330644 total: 1.37s remaining: 5.37s 61: learn: 3132.6465558 total: 1.39s remaining: 5.33s 62: learn: 3126.0904099 total: 1.41s remaining: 5.3s 63: learn: 3116.1633095 total: 1.42s remaining: 5.25s 64: learn: 3108.5382671 total: 1.44s remaining: 5.22s 65: learn: 3099.8123254 total: 1.47s remaining: 5.2s 66: learn: 3092.0807055 total: 1.49s remaining: 5.19s 67: learn: 3084.7899055 total: 1.51s remaining: 5.16s 68: learn: 3077.1775934 total: 1.53s remaining: 5.13s 69: learn: 3070.6885755 total: 1.55s remaining: 5.1s 70: learn: 3062.8860598 total: 1.57s remaining: 5.07s 71: learn: 3059.3199367 total: 1.59s remaining: 5.05s 72: learn: 3053.5898434 total: 1.61s remaining: 5.02s 73: learn: 3046.7184438 total: 1.63s remaining: 4.98s 74: learn: 3044.4028855 total: 1.65s remaining: 4.95s 75: learn: 3036.5127718 total: 1.67s remaining: 4.92s 76: learn: 3032.6835102 total: 1.69s remaining: 4.89s 77: learn: 3026.6082409 total: 1.71s remaining: 4.88s 78: learn: 3015.8495558 total: 1.74s remaining: 4.86s 79: learn: 3012.6783052 total: 1.76s remaining: 4.83s 80: learn: 3008.2269548 total: 1.77s remaining: 4.8s 81: learn: 3002.1445406 total: 1.79s remaining: 4.77s 82: learn: 2998.6921865 total: 1.81s remaining: 4.74s 83: learn: 2991.6592872 total: 1.83s remaining: 4.71s 84: learn: 2983.0559012 total: 1.85s remaining: 4.69s 85: learn: 2978.8719046 total: 1.87s remaining: 4.66s 86: learn: 2975.1527962 total: 1.89s remaining: 4.63s 87: learn: 2968.5163570 total: 1.91s remaining: 4.61s 88: learn: 2964.2521420 total: 1.94s remaining: 4.59s 89: learn: 2960.0459732 total: 1.96s remaining: 4.57s 90: learn: 2954.5277481 total: 1.98s remaining: 4.54s 91: learn: 2950.2775196 total: 2s remaining: 4.51s 92: learn: 2946.7894058 total: 2.02s remaining: 4.49s 93: learn: 2942.6919742 total: 2.04s remaining: 4.47s 94: learn: 2938.1521919 total: 2.06s remaining: 4.44s 95: learn: 2934.0822594 total: 2.07s remaining: 4.41s 96: learn: 2926.2332088 total: 2.09s remaining: 4.38s 97: learn: 2922.7756868 total: 2.11s remaining: 4.35s 98: learn: 2912.7873207 total: 2.13s remaining: 4.33s 99: learn: 2910.9889520 total: 2.15s remaining: 4.3s 100: learn: 2907.4645973 total: 2.17s remaining: 4.28s 101: learn: 2898.9339915 total: 2.19s remaining: 4.26s 102: learn: 2896.0940306 total: 2.21s remaining: 4.23s 103: learn: 2888.9555715 total: 2.23s remaining: 4.21s 104: learn: 2884.3369760 total: 2.25s remaining: 4.18s 105: learn: 2876.7747319 total: 2.27s remaining: 4.16s 106: learn: 2871.5527228 total: 2.29s remaining: 4.13s 107: learn: 2867.7893712 total: 2.31s remaining: 4.11s 108: learn: 2864.0985664 total: 2.37s remaining: 4.15s 109: learn: 2859.0235025 total: 2.39s remaining: 4.13s 110: learn: 2856.1162041 total: 2.42s remaining: 4.11s 111: learn: 2852.3310850 total: 2.43s remaining: 4.09s 112: learn: 2846.0348800 total: 2.46s remaining: 4.07s 113: learn: 2842.5079572 total: 2.48s remaining: 4.04s 114: learn: 2838.7404528 total: 2.5s remaining: 4.02s 115: learn: 2835.7568033 total: 2.52s remaining: 3.99s 116: learn: 2832.3121783 total: 2.54s remaining: 3.97s 117: learn: 2828.2519342 total: 2.56s remaining: 3.94s 118: learn: 2824.9035127 total: 2.58s remaining: 3.92s 119: learn: 2819.7277274 total: 2.6s remaining: 3.9s 120: learn: 2815.5542269 total: 2.62s remaining: 3.88s 121: learn: 2811.6318899 total: 2.64s remaining: 3.85s 122: learn: 2806.3458971 total: 2.65s remaining: 3.82s 123: learn: 2804.1917156 total: 2.67s remaining: 3.8s 124: learn: 2800.0994551 total: 2.69s remaining: 3.77s 125: learn: 2796.9184052 total: 2.71s remaining: 3.75s 126: learn: 2790.4601986 total: 2.73s remaining: 3.72s 127: learn: 2789.5690580 total: 2.75s remaining: 3.7s 128: learn: 2788.4653367 total: 2.77s remaining: 3.67s 129: learn: 2783.5127232 total: 2.79s remaining: 3.65s 130: learn: 2779.9807156 total: 2.81s remaining: 3.63s 131: learn: 2777.1546283 total: 2.83s remaining: 3.61s 132: learn: 2774.3216541 total: 2.86s remaining: 3.59s 133: learn: 2770.1664251 total: 2.88s remaining: 3.56s 134: learn: 2763.5489993 total: 2.89s remaining: 3.54s 135: learn: 2760.8648016 total: 2.91s remaining: 3.51s 136: learn: 2759.5762493 total: 2.93s remaining: 3.49s 137: learn: 2756.5814858 total: 2.95s remaining: 3.46s 138: learn: 2753.1751000 total: 2.97s remaining: 3.44s 139: learn: 2749.3518160 total: 2.99s remaining: 3.41s 140: learn: 2745.8234843 total: 3.01s remaining: 3.4s 141: learn: 2739.0352353 total: 3.03s remaining: 3.37s 142: learn: 2735.6354369 total: 3.05s remaining: 3.35s 143: learn: 2729.7815592 total: 3.07s remaining: 3.33s 144: learn: 2725.8207388 total: 3.09s remaining: 3.31s 145: learn: 2719.6153255 total: 3.11s remaining: 3.28s 146: learn: 2715.7367810 total: 3.13s remaining: 3.26s 147: learn: 2712.6892948 total: 3.15s remaining: 3.24s 148: learn: 2708.0471969 total: 3.17s remaining: 3.21s 149: learn: 2703.8629867 total: 3.19s remaining: 3.19s 150: learn: 2700.0992582 total: 3.21s remaining: 3.17s 151: learn: 2698.3763745 total: 3.23s remaining: 3.15s 152: learn: 2695.4965291 total: 3.26s remaining: 3.13s 153: learn: 2693.7759263 total: 3.28s remaining: 3.11s 154: learn: 2692.0653007 total: 3.3s remaining: 3.08s 155: learn: 2690.4429784 total: 3.31s remaining: 3.06s 156: learn: 2683.0178345 total: 3.33s remaining: 3.04s 157: learn: 2680.3969856 total: 3.35s remaining: 3.01s 158: learn: 2678.8322531 total: 3.37s remaining: 2.99s 159: learn: 2674.7737978 total: 3.39s remaining: 2.97s 160: learn: 2670.8520986 total: 3.41s remaining: 2.94s 161: learn: 2666.5752844 total: 3.43s remaining: 2.92s 162: learn: 2662.0612348 total: 3.45s remaining: 2.9s 163: learn: 2658.9172892 total: 3.47s remaining: 2.88s 164: learn: 2656.3127579 total: 3.49s remaining: 2.85s 165: learn: 2654.2956245 total: 3.51s remaining: 2.83s 166: learn: 2650.8342006 total: 3.52s remaining: 2.81s 167: learn: 2647.8452631 total: 3.54s remaining: 2.79s 168: learn: 2645.4351021 total: 3.56s remaining: 2.76s 169: learn: 2643.4199372 total: 3.58s remaining: 2.74s 170: learn: 2637.8088384 total: 3.6s remaining: 2.72s 171: learn: 2634.8963781 total: 3.62s remaining: 2.7s 172: learn: 2631.4148077 total: 3.64s remaining: 2.67s 173: learn: 2628.9342179 total: 3.67s remaining: 2.65s 174: learn: 2626.9344642 total: 3.69s remaining: 2.63s 175: learn: 2624.1794277 total: 3.71s remaining: 2.61s 176: learn: 2621.0677407 total: 3.72s remaining: 2.59s 177: learn: 2619.0864622 total: 3.75s remaining: 2.57s 178: learn: 2616.4262417 total: 3.77s remaining: 2.54s 179: learn: 2611.8023591 total: 3.79s remaining: 2.52s 180: learn: 2607.4758502 total: 3.8s remaining: 2.5s 181: learn: 2604.8386560 total: 3.83s remaining: 2.48s 182: learn: 2602.7118129 total: 3.84s remaining: 2.46s 183: learn: 2600.0099984 total: 3.87s remaining: 2.44s 184: learn: 2595.3616868 total: 3.89s remaining: 2.42s 185: learn: 2590.7486350 total: 3.91s remaining: 2.4s 186: learn: 2588.7296346 total: 3.93s remaining: 2.38s 187: learn: 2585.1133526 total: 3.95s remaining: 2.35s 188: learn: 2583.6693594 total: 3.97s remaining: 2.33s 189: learn: 2581.2657345 total: 3.98s remaining: 2.31s 190: learn: 2579.2343592 total: 4.01s remaining: 2.29s 191: learn: 2574.2903214 total: 4.03s remaining: 2.26s 192: learn: 2571.4414223 total: 4.04s remaining: 2.24s 193: learn: 2568.8347581 total: 4.06s remaining: 2.22s 194: learn: 2566.4013266 total: 4.08s remaining: 2.2s 195: learn: 2563.6033367 total: 4.1s remaining: 2.18s 196: learn: 2559.6417802 total: 4.12s remaining: 2.16s 197: learn: 2556.6837746 total: 4.14s remaining: 2.13s 198: learn: 2553.7559952 total: 4.16s remaining: 2.11s 199: learn: 2550.2036534 total: 4.18s remaining: 2.09s 200: learn: 2548.9971420 total: 4.2s remaining: 2.07s 201: learn: 2546.8813073 total: 4.22s remaining: 2.05s 202: learn: 2542.8649668 total: 4.24s remaining: 2.02s 203: learn: 2540.2543083 total: 4.26s remaining: 2s 204: learn: 2538.1358868 total: 4.28s remaining: 1.98s 205: learn: 2536.2642955 total: 4.3s remaining: 1.96s 206: learn: 2533.9119102 total: 4.32s remaining: 1.94s 207: learn: 2528.1194337 total: 4.34s remaining: 1.92s 208: learn: 2525.6137601 total: 4.36s remaining: 1.9s 209: learn: 2524.0504020 total: 4.38s remaining: 1.88s 210: learn: 2520.2937581 total: 4.4s remaining: 1.85s 211: learn: 2518.0978872 total: 4.42s remaining: 1.83s 212: learn: 2515.0674585 total: 4.44s remaining: 1.81s 213: learn: 2512.6230409 total: 4.46s remaining: 1.79s 214: learn: 2509.6020012 total: 4.48s remaining: 1.77s 215: learn: 2507.8875596 total: 4.5s remaining: 1.75s 216: learn: 2505.9268766 total: 4.51s remaining: 1.73s 217: learn: 2504.9292868 total: 4.56s remaining: 1.71s 218: learn: 2503.1386164 total: 4.58s remaining: 1.69s 219: learn: 2501.0653685 total: 4.59s remaining: 1.67s 220: learn: 2498.9974693 total: 4.61s remaining: 1.65s 221: learn: 2496.0286586 total: 4.63s remaining: 1.63s 222: learn: 2494.2660921 total: 4.65s remaining: 1.61s 223: learn: 2492.2409663 total: 4.67s remaining: 1.58s 224: learn: 2490.3975768 total: 4.69s remaining: 1.56s 225: learn: 2487.8069637 total: 4.71s remaining: 1.54s 226: learn: 2484.6277515 total: 4.73s remaining: 1.52s 227: learn: 2482.8896285 total: 4.76s remaining: 1.5s 228: learn: 2479.3549975 total: 4.78s remaining: 1.48s 229: learn: 2475.7218650 total: 4.8s remaining: 1.46s 230: learn: 2474.8583222 total: 4.81s remaining: 1.44s 231: learn: 2473.3566172 total: 4.83s remaining: 1.42s 232: learn: 2470.3861361 total: 4.85s remaining: 1.4s 233: learn: 2467.9915732 total: 4.87s remaining: 1.37s 234: learn: 2465.3891239 total: 4.89s remaining: 1.35s 235: learn: 2462.8645457 total: 4.91s remaining: 1.33s 236: learn: 2459.6978387 total: 4.93s remaining: 1.31s 237: learn: 2458.3786441 total: 4.95s remaining: 1.29s 238: learn: 2456.3833809 total: 4.98s remaining: 1.27s 239: learn: 2454.4339435 total: 5s remaining: 1.25s 240: learn: 2451.7606019 total: 5.01s remaining: 1.23s 241: learn: 2449.1438891 total: 5.03s remaining: 1.21s 242: learn: 2447.9914282 total: 5.05s remaining: 1.18s 243: learn: 2444.9214230 total: 5.07s remaining: 1.16s 244: learn: 2440.7150517 total: 5.09s remaining: 1.14s 245: learn: 2436.5246796 total: 5.11s remaining: 1.12s 246: learn: 2433.9491893 total: 5.13s remaining: 1.1s 247: learn: 2431.5627637 total: 5.15s remaining: 1.08s 248: learn: 2430.0273642 total: 5.17s remaining: 1.06s 249: learn: 2428.7527475 total: 5.19s remaining: 1.04s 250: learn: 2424.5047895 total: 5.21s remaining: 1.02s 251: learn: 2423.5021186 total: 5.23s remaining: 996ms 252: learn: 2422.0652624 total: 5.25s remaining: 975ms 253: learn: 2418.1207629 total: 5.27s remaining: 955ms 254: learn: 2417.0179021 total: 5.29s remaining: 934ms 255: learn: 2415.4887237 total: 5.31s remaining: 912ms 256: learn: 2413.6604080 total: 5.33s remaining: 891ms 257: learn: 2412.6497126 total: 5.35s remaining: 871ms 258: learn: 2410.6288402 total: 5.37s remaining: 850ms 259: learn: 2409.1988976 total: 5.39s remaining: 829ms 260: learn: 2407.6792740 total: 5.41s remaining: 808ms 261: learn: 2404.9377250 total: 5.42s remaining: 787ms 262: learn: 2400.8398693 total: 5.45s remaining: 767ms 263: learn: 2395.9763618 total: 5.47s remaining: 746ms 264: learn: 2391.4009628 total: 5.49s remaining: 725ms 265: learn: 2389.9816519 total: 5.51s remaining: 704ms 266: learn: 2385.9111771 total: 5.53s remaining: 683ms 267: learn: 2383.9772362 total: 5.55s remaining: 662ms 268: learn: 2381.3434401 total: 5.57s remaining: 641ms 269: learn: 2378.7573890 total: 5.59s remaining: 621ms 270: learn: 2376.5264175 total: 5.61s remaining: 600ms 271: learn: 2375.1055527 total: 5.63s remaining: 580ms 272: learn: 2373.1316252 total: 5.65s remaining: 559ms 273: learn: 2372.5697080 total: 5.67s remaining: 538ms 274: learn: 2371.7149805 total: 5.69s remaining: 517ms 275: learn: 2370.4321563 total: 5.71s remaining: 496ms 276: learn: 2369.6868366 total: 5.73s remaining: 476ms 277: learn: 2366.5589271 total: 5.75s remaining: 455ms 278: learn: 2364.5949262 total: 5.77s remaining: 434ms 279: learn: 2362.7275188 total: 5.79s remaining: 413ms 280: learn: 2359.4790672 total: 5.81s remaining: 393ms 281: learn: 2356.9582384 total: 5.83s remaining: 372ms 282: learn: 2355.2890264 total: 5.85s remaining: 352ms 283: learn: 2353.2774474 total: 5.87s remaining: 331ms 284: learn: 2350.3137395 total: 5.89s remaining: 310ms 285: learn: 2349.1723162 total: 5.91s remaining: 289ms 286: learn: 2347.0432612 total: 5.93s remaining: 269ms 287: learn: 2344.5699555 total: 5.95s remaining: 248ms 288: learn: 2343.9683648 total: 5.97s remaining: 227ms 289: learn: 2342.3714576 total: 5.98s remaining: 206ms 290: learn: 2340.2806260 total: 6s remaining: 186ms 291: learn: 2338.6507973 total: 6.02s remaining: 165ms 292: learn: 2336.4315345 total: 6.04s remaining: 144ms 293: learn: 2335.1754544 total: 6.07s remaining: 124ms 294: learn: 2333.8364553 total: 6.09s remaining: 103ms 295: learn: 2332.1175752 total: 6.11s remaining: 82.5ms 296: learn: 2330.2244001 total: 6.12s remaining: 61.9ms 297: learn: 2328.8042956 total: 6.14s remaining: 41.2ms 298: learn: 2327.0888459 total: 6.16s remaining: 20.6ms 299: learn: 2323.2451682 total: 6.18s remaining: 0us Mejor selección: <catboost.core.CatBoostRegressor object at 0x000002453A0B7B80> Score train: 0.8863684085128496
print('score test: ', gs_cb_reg.score (X_test, y_test))
score test: 0.8865574832900184
mse = mean_squared_error(y_true=y_test,
y_pred=gs_cb_reg.predict(X_test))
rmse = np.sqrt(mse)
print("Catboost Regressor tiene un RMSE en el conjunto de test de " + str(rmse))
Catboost Regressor tiene un RMSE en el conjunto de test de 3216.17001774777
grid_xgb_reg = {"n_estimators": [100, 300],
"max_depth": [3, 9, 11],
"subsample": [0.03, 0.1, 0.5],
"learning_rate": [0.03, 0.1],
"colsample_bytree": [0.3, 0.7]
}
# GridSearch:
gs_xgb_reg = GridSearchCV(estimator = xgb_reg,
param_grid = grid_xgb_reg,
scoring = 'r2',
n_jobs=-1,
cv=10,
verbose=1)
gs_xgb_reg.fit(X_train, y_train)
print('Mejor selección: ', gs_xgb_reg.best_estimator_)
print('Score train: ', gs_xgb_reg.best_score_, 2)
Fitting 10 folds for each of 72 candidates, totalling 720 fits
Mejor selección: XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.7, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=11, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=300, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=42, ...)
Score train: 0.8950928104957875 2
from sklearn.metrics import r2_score
r2 = r2_score(y_true=y_test,
y_pred=gs_xgb_reg .predict(X_test))
mse = mean_squared_error(y_true=y_test,
y_pred=gs_xgb_reg.predict(X_test))
rmse = np.sqrt(mse)
print("El modelo XGBoost Regressor tiene un R2 en el conjunto de test de " + str(r2))
print("El modelo XGBoost Regressor tiene un RMSE en el conjunto de test de " + str(rmse))
El modelo XGBoost Regressor tiene un R2 en el conjunto de test de 0.8971292992427436 El modelo XGBoost Regressor tiene un RMSE en el conjunto de test de 3062.6468500761334
conjunto_de_GridSearchs = [gs_ab_reg ,
gs_gb_reg ,
gs_lgb_reg ,
gs_cb_reg ,
gs_xgb_reg ]
best_score = [gs_ab_reg.best_score_, gs_gb_reg.best_score_, gs_lgb_reg.best_score_, gs_cb_reg.best_score_,
gs_xgb_reg.best_score_]
best_score
#XGBoost Regressor.
[0.5308523379539066, 0.8826310919750598, 0.8815811343287233, 0.8863684085128496, 0.8950928104957875]
mejor_puntuación_GridSearch = conjunto_de_GridSearchs[4].best_estimator_
mejor_puntuación_GridSearch
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.7, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.1, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=11, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=300, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=42, ...)
![]() | Plot the importance of the 10 most important features for the best model you find. Which are those features? Are they the same and in the same proportion than for `dt`? Why/why not? |
gs_xgb_reg = gs_xgb_reg.best_estimator_
gs_xgb_reg.feature_importances_
array([0.03375672, 0.01227906, 0.00611096, 0.00232226, 0.00749108,
0.0086685 , 0.00890077, 0.00258356, 0.00997281, 0.0057821 ,
0.00541655, 0.00863947, 0.0071637 , 0.00497436, 0.00584285,
0.00225383, 0.00507726, 0.00586527, 0.00382546, 0.00896826,
0.00882643, 0.00705131, 0.0015811 , 0.01504467, 0.00641449,
0.00246683, 0.00970755, 0.01240706, 0.00198961, 0.00934782,
0.0074902 , 0.00280123, 0.00522738, 0.00567962, 0.01483084,
0.00396883, 0.00428546, 0.0295811 , 0.00953096, 0.00407492,
0.00774525, 0.00310166, 0.01081623, 0.0040716 , 0.00367444,
0.00620251, 0.00837924, 0.02075411, 0.01137489, 0.00929384,
0.06406159, 0.00265261, 0.02235937, 0.08456504, 0.00649089,
0.06493517, 0.0038346 , 0.027018 , 0.00761793, 0.00659274,
0.00420821, 0.00419161, 0.00172993, 0.00273793, 0.00705998,
0.00626196, 0.0047779 , 0.00672948, 0.00997643, 0.01427866,
0.08461848, 0.00823848, 0.00791812, 0.01320508, 0.00768335,
0.00474341, 0.00507872, 0.00465267, 0.01084319, 0.00522084,
0.00981415, 0.01178969, 0.02002072, 0.00870402, 0.0029592 ,
0.00309086, 0.00249973, 0.00271275, 0.00467182, 0.00298751,
0.00227844, 0.00488396, 0.00278622, 0.00270417, 0.00292249,
0.00386328, 0.00341268], dtype=float32)
sort = gs_xgb_reg.feature_importances_.argsort()
features_importance = pd.DataFrame(columns = X_train.columns[sort])
features_importance
| manufacturer_land rover | title_status_missing | manufacturer_mini | manufacturer_harley-davidson | paint_color_grey | manufacturer_alfa-romeo | manufacturer_mazda | paint_color_blue | manufacturer_buick | cylinders_5 | ... | type_truck | cylinders_10 | cylinders_6 | fuel_gas | manufacturer_tesla | age | cylinders_4 | fuel_diesel | cylinders_8 | drive_fwd |
|---|
0 rows × 97 columns
features_importance.loc[0] = gs_xgb_reg.feature_importances_[sort]
features_importance
| manufacturer_land rover | title_status_missing | manufacturer_mini | manufacturer_harley-davidson | paint_color_grey | manufacturer_alfa-romeo | manufacturer_mazda | paint_color_blue | manufacturer_buick | cylinders_5 | ... | type_truck | cylinders_10 | cylinders_6 | fuel_gas | manufacturer_tesla | age | cylinders_4 | fuel_diesel | cylinders_8 | drive_fwd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.003 | 0.003 | ... | 0.020 | 0.021 | 0.022 | 0.027 | 0.030 | 0.034 | 0.064 | 0.065 | 0.085 | 0.085 |
1 rows × 97 columns
Top_fi = features_importance.drop(features_importance.columns[0: 87], axis='columns')
Top_fi
| type_truck | cylinders_10 | cylinders_6 | fuel_gas | manufacturer_tesla | age | cylinders_4 | fuel_diesel | cylinders_8 | drive_fwd | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.020 | 0.021 | 0.022 | 0.027 | 0.030 | 0.034 | 0.064 | 0.065 | 0.085 | 0.085 |
Top_fi.plot.bar(fontsize = 12, rot=35,
title = 'Top XGBoost Features According to Their Importance',
xlabel = 'XGBoost Regressor Features',
ylabel = 'Importancia'
)
![]() | Compute the predictions for your best model, and plot them against the real prices in `Y_test`. Are you surprised by the results? Do the same for `Y_train` and see if the behavior has degraded considerably or not. |
predXgboost_train = gs_xgb_reg.predict(X_train)
predXgboost_train
array([19479.406 , 7969.9023, 7970.1777, ..., 3756.1501, 6835.316 ,
39225.723 ], dtype=float32)
predXgboost_train_df = X_train
predXgboost_train_df["predXgboost_train"] = predXgboost_train
predXgboost_train_df.head()
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_custom | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_train | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29645 | 4 | 72500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 19479.406 |
| 68355 | 17 | 136000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 7969.902 |
| 52304 | 8 | 109526 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7970.178 |
| 62400 | 14 | 184850 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5868.347 |
| 11259 | 4 | 70000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30769.883 |
5 rows × 98 columns
predXgboost_train_df['price'] = y_train
predXgboost_train_df.head()
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_train | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 29645 | 4 | 72500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 19479.406 | 24495 |
| 68355 | 17 | 136000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 7969.902 | 6500 |
| 52304 | 8 | 109526 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7970.178 | 7300 |
| 62400 | 14 | 184850 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5868.347 | 3995 |
| 11259 | 4 | 70000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30769.883 | 31900 |
5 rows × 99 columns
predXgboost_train_df["diferencia"] = predXgboost_train_df["predXgboost_train"] - y_train
predXgboost_train_df = predXgboost_train_df.sort_values(by = "price", ascending = False)
predXgboost_train_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_train | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 72777 | 0 | 4700 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 49176.090 | 50000 | -823.910 |
| 21483 | 4 | 57100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47417.379 | 50000 | -2582.621 |
| 66009 | 4 | 22500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46287.512 | 50000 | -3712.488 |
| 71181 | 1 | 7900 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47532.562 | 50000 | -2467.438 |
| 55346 | 5 | 2236 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 46100.117 | 50000 | -3899.883 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 47868 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 47918 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 14672 | 17 | 192000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4915.553 | 500 | 4415.553 |
| 26146 | 21 | 140007 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 749.533 | 500 | 249.533 |
| 38105 | 19 | 197000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 750.442 | 500 | 250.442 |
62304 rows × 100 columns
from bokeh.io import output_notebook, show
output_notebook()
from bokeh.plotting import Figure
train = Figure(title = 'TRAIN : Precio real vs Predicciones', width=800, height=400, tools=['pan',
'xwheel_zoom',
'hover',
'reset',
'help'])
train.scatter(x = predXgboost_train, y = y_train, size=10, fill_color='#DA1313')
train.yaxis.axis_label='Predicciones'
train.xaxis.axis_label='Real'
show(train)
predXgboost_train_df["diferencia"].mad()
1195.1600614702165
predXgboost_test = gs_xgb_reg.predict(x_test)
predXgboost_test
array([ 4351.176, 11426.97 , 27498.592, ..., 11860.888, 16020.244,
7016.037], dtype=float32)
predXgboost_test_df = x_test
predXgboost_test_df['predXgboost_test'] = predXgboost_test
predXgboost_test_df.head()
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_custom | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27026 | 11 | 159427 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4351.176 |
| 31692 | 10 | 153164 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11426.970 |
| 1520 | 5 | 151318 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 27498.592 |
| 59465 | 7 | 75700 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 13862.031 |
| 67441 | 17 | 191588 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4214.032 |
5 rows × 98 columns
predXgboost_test_df['price'] = y_test
predXgboost_test_df.head()
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_green | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27026 | 11 | 159427 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4351.176 | 5890 |
| 31692 | 10 | 153164 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11426.970 | 9995 |
| 1520 | 5 | 151318 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 27498.592 | 23999 |
| 59465 | 7 | 75700 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 13862.031 | 11900 |
| 67441 | 17 | 191588 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4214.032 | 3800 |
5 rows × 99 columns
predXgboost_test_df['diferencia'] = predXgboost_test_df['predXgboost_test'] - y_test
predXgboost_test_df = predXgboost_test_df.sort_values(by ='price', ascending = False)
predXgboost_test_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 53007 | 0 | 2315 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 35676.855 | 50000 | -14323.145 |
| 42034 | 0 | 1800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42111.445 | 50000 | -7888.555 |
| 63242 | 4 | 136600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42574.672 | 50000 | -7425.328 |
| 70819 | 3 | 10500 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30308.906 | 50000 | -19691.094 |
| 25341 | 2 | 47000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 32528.229 | 49999 | -17470.771 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 48518 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 41212 | 2 | 2520 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 31718.785 | 500 | 31218.785 |
| 74640 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 68171 | 18 | 205000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2910.247 | 500 | 2410.247 |
| 36515 | 17 | 160000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 930.583 | 500 | 430.583 |
15576 rows × 100 columns
predXgboost_test_df.reset_index(drop=True, inplace=True)
predXgboost_test_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2315 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 35676.855 | 50000 | -14323.145 |
| 1 | 0 | 1800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42111.445 | 50000 | -7888.555 |
| 2 | 4 | 136600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42574.672 | 50000 | -7425.328 |
| 3 | 3 | 10500 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30308.906 | 50000 | -19691.094 |
| 4 | 2 | 47000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 32528.229 | 49999 | -17470.771 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15571 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 15572 | 2 | 2520 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 31718.785 | 500 | 31218.785 |
| 15573 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 15574 | 18 | 205000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2910.247 | 500 | 2410.247 |
| 15575 | 17 | 160000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 930.583 | 500 | 430.583 |
15576 rows × 100 columns
test = Figure(title = 'TEST : Precio real vs Predicciones', width=800, height=400, tools=['pan',
'xwheel_zoom',
'hover',
'reset',
'help'])
test.scatter(x = predXgboost_test, y = y_test, size=10, fill_color='#13DA2B')
test.yaxis.axis_label='Predicciones'
test.xaxis.axis_label='Real'
show(test)
#Positive correlation
predXgboost_test_df["diferencia"].mad()
1842.4206098594052
round(predXgboost_test_df["diferencia"].mad() - predXgboost_train_df["diferencia"].mad(), 2)
647.26
![]() | Find some examples of the following: 1. Cars whose price are very underestimated by the model. 2. Cars whose price are very overestimated by the model. 3. Cars whose price are almost predicted exactly by the model. Can you figure out the reasons behind 1 and 2? |
predXgboost_test_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2315 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 35676.855 | 50000 | -14323.145 |
| 1 | 0 | 1800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42111.445 | 50000 | -7888.555 |
| 2 | 4 | 136600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42574.672 | 50000 | -7425.328 |
| 3 | 3 | 10500 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30308.906 | 50000 | -19691.094 |
| 4 | 2 | 47000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 32528.229 | 49999 | -17470.771 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15571 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 15572 | 2 | 2520 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 31718.785 | 500 | 31218.785 |
| 15573 | 7 | 44995 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2319.616 | 500 | 1819.616 |
| 15574 | 18 | 205000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2910.247 | 500 | 2410.247 |
| 15575 | 17 | 160000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 930.583 | 500 | 430.583 |
15576 rows × 100 columns
infra_df= predXgboost_test_df [predXgboost_test_df ['diferencia'] < -3062.64]
infra_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2315 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 35676.855 | 50000 | -14323.145 |
| 1 | 0 | 1800 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42111.445 | 50000 | -7888.555 |
| 2 | 4 | 136600 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42574.672 | 50000 | -7425.328 |
| 3 | 3 | 10500 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30308.906 | 50000 | -19691.094 |
| 4 | 2 | 47000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 32528.229 | 49999 | -17470.771 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11623 | 11 | 181626 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1892.651 | 5500 | -3607.349 |
| 11633 | 21 | 214000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1802.560 | 5500 | -3697.440 |
| 12073 | 21 | 250000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1874.540 | 5000 | -3125.460 |
| 12320 | 28 | 85235 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 701.932 | 4990 | -4288.068 |
| 12838 | 21 | 214036 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1083.462 | 4500 | -3416.538 |
1268 rows × 100 columns
sobre_df= predXgboost_test_df [predXgboost_test_df ['diferencia'] > 3062.64]
sobre_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48 | 2 | 35499 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 52207.832 | 46995 | 5212.832 |
| 137 | 3 | 39665 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 51740.793 | 44750 | 6990.793 |
| 200 | 2 | 55259 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 46726.535 | 42900 | 3826.535 |
| 211 | 2 | 40000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 47502.191 | 42500 | 5002.191 |
| 424 | 1 | 21000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 42089.918 | 38000 | 4089.918 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15547 | 3 | 23829 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 34347.172 | 507 | 33840.172 |
| 15548 | 2 | 19757 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 34656.332 | 507 | 34149.332 |
| 15552 | 1 | 41970 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 15280.449 | 500 | 14780.449 |
| 15558 | 29 | 200000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 5175.416 | 500 | 4675.416 |
| 15572 | 2 | 2520 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 31718.785 | 500 | 31218.785 |
1341 rows × 100 columns
casi_df = predXgboost_test_df [(predXgboost_test_df ['diferencia'] > 0) & (predXgboost_test_df ['diferencia'] < 100)]
casi_df
| age | odometer | manufacturer_acura | manufacturer_alfa-romeo | manufacturer_aston-martin | manufacturer_audi | manufacturer_bmw | manufacturer_buick | manufacturer_cadillac | manufacturer_chevrolet | ... | paint_color_grey | paint_color_orange | paint_color_purple | paint_color_red | paint_color_silver | paint_color_white | paint_color_yellow | predXgboost_test | price | diferencia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 85 | 2 | 165053 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 45939.434 | 45900 | 39.434 |
| 86 | 2 | 165053 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 45939.434 | 45900 | 39.434 |
| 87 | 2 | 165053 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 45939.434 | 45900 | 39.434 |
| 89 | 2 | 165053 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 45939.434 | 45900 | 39.434 |
| 158 | 2 | 31216 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 44067.941 | 43990 | 77.941 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15268 | 14 | 178000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1594.271 | 1500 | 94.271 |
| 15274 | 20 | 233449 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1576.058 | 1500 | 76.058 |
| 15294 | 15 | 260500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1446.109 | 1400 | 46.109 |
| 15388 | 20 | 210000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1051.914 | 1000 | 51.914 |
| 15448 | 2 | 41000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1010.621 | 999 | 11.621 |
420 rows × 100 columns
![]() | The above missing strategy is too restrictive. There're some columns that need to be there as we have no way of inferring `price`, `year`, `odometer`, `condition`, `title_status` and `paint_color`, but there's another column in the full CSV version (`model`, the specific model of the car), from which we can infer some specific information. That is, once we know `model`, we can try to infer from the rest of the dataset its `manufacturer`, `cylinders`, `fuel`, `transmission`, `drive` and `type`. For example, if we find `model='corsa 1.4d'` we can infer that `manufacturer='opel'` and `fuel='diesel'`). Can you implement this strategy? How many more rows do you obtain now? Repeat the train/test split and the model training with this extended dataset. Can you improve on previous results? By how much? |
![]() | Besides `model`, there are yet more features in that full CSV file that can be loaded: * `url`: URL with the original second-hand car ad. * `image_url`: URL with a photo of the car. * `region`: region in the United States where the car was being sold. * `region_url`: URL with Craigslist link of that region. * `vin`: Vehicle Identification Number of the car. * `description`: the text on the car ad itself (kind of free text). * `county`: US county where the car was being sold. * `state`: US state where the car was being sold. * `lat`: exact latitude of the car. * `lon`: exact longitude of the car. Which of these would you use to build better models? Try to use any of these and see if you can further improve (more points the more you try and the more you improve!). |